40 research outputs found

    The Blame Game: Performance Analysis of Speaker Diarization System Components

    Get PDF
    In this paper we discuss the performance analysis of a speaker diarization system similar to the system that was submitted by ICSI at the NIST RT06s evaluation benchmark. The analysis that is based on a series of oracle experiments, provides a good understanding of the performance of each system component on a test set of twelve conference meetings used in previous NIST benchmarks. Our analysis shows that the speech activity detection component contributes most to the total diarization error rate (23%). The lack of ability to model verlapping speech is also a large source of errors (22%) followed by the component that creates the initial system models (15%)

    Filtering the Unknown: Speech Activity Detection in Heterogeneous Video Collections

    Get PDF
    In this paper we discuss the speech activity detection system that we used for detecting speech regions in the Dutch TRECVID video collection. The system is designed to filter non-speech like music or sound effects out of the signal without the use of predefined non-speech models. Because the system trains its models on-line, it is robust for handling out-of-domain data. The speech activity error rate on an out-of-domain test set, recordings of English conference meetings, was 4.4%. The overall error rate on twelve randomly selected five minute TRECVID fragments was 11.5%

    Acoustic Beamforming for Speaker Diarization of Meetings

    Full text link

    Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information

    Full text link

    Automatic Cluster Complexity and Quantity Selection: Towards Robust Speaker Diarization

    Full text link
    Abstract. The goal of speaker diarization is to determine where each participant speaks in a recording. One of the most commonly used tech-nique is agglomerative clustering, where some number of initial models are grouped into the number of present speakers. The choice of com-plexity, topology, and the number of initial models is vital to the final outcome of the clustering algorithm. In prior systems, these parameters were directly assigned based on development data, and were the same for all recordings. In this paper we present three techniques to select the parameters individually for each case, obtaining a system that is more robust to changes in the data. Although the choice of these values de-pends on tunable parameters, they are less sensitive to changes in the acoustic data and to how the algorithm distributes data among the differ-ent clusters. We show that by using the three techniques, we achieve an improvement up to 8 % relative in the development set and 19 % relative in the test set over prior systems.

    The ICSI Meeting Corpus

    Get PDF
    We have collected a corpus of data from natural meetings that occurred at the International Computer Science Institute (ICSI) in Berkeley, California over the last three years. The corpus contains audio recorded simultaneously from head-worn and table-top microphones, word-level transcripts of meetings, and various metadata on participants, meetings, and hardware. Such a corpus supports work in automatic speech recognition, noise robustness, dialog modeling, prosody, rich transcription, information retrieval, and more. We present details on the contents of the corpus, as well as rationales for the decisions that led to its configuration. The corpus were delivered to the Linguistic Data Consortium (LDC)
    corecore